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A number of space missions dedicated to the search for exo-planets via the transit 
method, such as COROT, Eddington and Kepler, are planned for launch over the 
next few years. They will need to address problems associated with the automated 
and efficient detection of planetary transits in light curves affected by a variety of 
noise sources, including stellar variability. To maximise the scientific return of these 
^ . missions, it is important to develop and test appropriate algorithms in advance of 

their launch dates. 

^\ \ Starting from a general purpose maximum likelihood approach we discuss the 

' links between a variety of period and transit finding methods. The natural endpoint 

of this hierarchy of methods is shown to be a fast, robust and statistically efficient 
, least-squares algorithm based on box-shaped transits. 

■ This approach is predicated on the assumption of periodic transits hidden in 

random noise, usually assumed to be superposed on a flat continuum with regular 

. continuous sampling. We next show how to generalise the transit finding method to 

PIh' the more realistic scenario where complex stellar (micro) variability, irregular sampling 

' \ and long gaps in the data are all present. 

^ . Tests of this methodology on simulated Eddington light curves, including realistic 

stellar micro-variability, irregular sampling and gaps in the data record, are used to 
^ , quantify the performance. Visually, these systematic effects can completely overwhelm 

the underlying signal of interest. However, in the case where transit durations are short 
compared to the dominant timescales for stellar variability and data record segments, 
, it is possible to decouple the transit signal from the remainder. 

■ We conclude that even with realistic contamination from stellar variability, irregu- 
\ lar sampling, and gaps in the data record, it is still possible to detect transiting planets 

with an efficiency close to the idealised theoretical bound. In particular, space mis- 
sions have the potential to approach the regime of detecting earth-like planets around 
G2V-type stars. 

Key words: Exo-planets - methods: data analysis - techniques: photometric. 



1 INTRODUCTION 

The discovery of the first exo-planet orbiti nR a Sun-like star 
was a nnounced almost a decade ago by iMavor fc Quelo3 
1^9^. Since then extraordinary progress has been made, 
and the number of planets discovered to date is well beyond 
the hundred mark^. As well as probing age-old questions 
such as the existence of life beyond the Earth, these dis- 
coveries are fundamental to understanding how planets and 
planetary systems form, and whether ours is a typical one. 
The gaseous giant planets discovered so far have prompted a 
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re-thinking of planet formation theories due to their close-in 
and/or eccentric orbits. 

Among the various methods available to search for exo- 
planets, the transit method presents a number of advan- 
tages. The most immediate are that it allows direct deter- 
mination of the planet's radius relative to that of its par- 
ent star, the orbital inclination and, provided more than 
one transit is observed, the orbital period. Combined with 
radial velocity observations, a measurement of the planet 
mass free of the sini degeneracy can be obtained. The 
transit method also allows the simultaneous monitoring of 
many thousands of target stars. This multiplexing capabil- 
ity is a necessity, due to the stringent requirement on the 
alignment of the orbit with the line of sight for transits 
to occur. The first planet candidates tentatively detected 
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via the tr ansit method have be en announced over the last 
year or so JUdalski et alJ002allbl:jMallen-Ornelas et alJ20o'3 
IStreet et alJ200^1^el^le^etaLl 200?r"and one has received 
tentative radial velocity confirmation (Konacki et al. 2003()- 
The plethora of ground-based searches currently underway 
fsee lHorii3l2002l for a review) is expected to yield hundreds 
of candidate transiting giant exo-planets in the next few 
years. 

However, terrestrial planets, capable of harbouring li- 
quid water on their surface, are beyond the reach of the 
methods used so far. Detecting them is the goal of a 
number of planned space-mi ssions, such as the Franco- 
Euro pean sateUite CO ROT jB aglin fc t he COROT TeamI 
I2OO3I) . N ASA's Keyler jBoruckT et al..20o3) and ESA's Ed- 
dington llFavatall200l) . These should push the numbers of 
known exo-planets into the thousands. 

The detection of a weak, short, periodic transit sig- 
nal in noisy light curves is a challenging task. The large 
number of light curves collected make the automation and 
optimisation of the process a necessity. This requirement 
is even stronger in the context of space missions, which 
will collect even larger amounts of data and where teleme- 
try limitations will require as much of the processing to 
be done on board as possible. A number of transit detec- 
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compar e their resp ective performances in a controlled fash- 
ion ( Ti nglevI l003al) . but there is currently no widespread 
agreement on the optimal method to use. 

In a previous paper i Aigrain fc Favatall2002l hereafter 
Paper I), a dedicated Bayesian transit search algorithm was 
derived, based on t he more genera l period finding met hod of 
Gregory fc Loredo llGregorv fc Lo rcdo 1992; Greg orv|ll999l 
hereafter GL92 and G99 respectively). Here we develop this 
algorithm further and attempt to reconcile the apparent di- 
versity of the extant transit algorithms. Starting from the 
original Gregory fc Loredo prescription, which is based on 
a maximum likelihood (ML) estimation for a periodic step- 
function model of unspecified shape, appropriate sequential 
simplifications can be made. We demonstrate that the levels 
of the step-function bins - which define the shape of the de- 
tected event - are not free parameters, their optimal values 
being fully defined by the data. The use of Bayesian pri- 
ors can be dropped, given the lack of information currently 
available on the appropriate form for these priors. Finally, 
for detection purposes, the model can be simplified to an 
unequal mark-space ratio square wave with only one out-of- 
transit and one in-transit value. The algorithm itself and its 
implementation are presented in Sect.|21 The performance 
has proved better than that of the previous version, and 
the computational requirements have been significantly re- 
duced. Pursuing this simplification has also highlighted the 
similarities between the previously published transit detec- 
tion methods. 

However, ML-based algorithms are only optimised for 
data containing simple transits embedded in random noise 
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(usually well approximated by a Gaussian distribution). 
Real transit search light curves will contain instrinsic stel- 
lar variability of various amplitudes and shapes. They will 
also suffer from irregular sampling, with frequent large gaps 
in the coverage. Combined, these effects can pose a major 
threat to our ability to detect planets. This problem is il- 
lustrated, for the case of ground-based data, by recent data 
from the UNSW planet search project using the Automated 
Patrol telescope at Siding Springs observatory: in 5 nights 
of observations of the open cluster NGC6633, nearly all of 
the 1000 bri ghtest stars w ere found to be variable at the mil- 
limag level iHidas et alj !2003'l. With the even higher preci- 
sion possible with upcoming space missions (~ 0.1 mmag), 
this problem will become even more acute due to the sensi- 
tivity to additional stellar activity-induced variability. Wor- 
ries that this could seriously impair the detection of terres- 
trial planets have led to the develop ment of variability filters 
iJenkinj2002l : ICarpano et al .120031) . but these are applicable 
only to data with regular sampling and no gaps. In Sect.|21 
we introduce more generic filters applicable to irregularly 
sampled data, or data with gaps (as expected for space mis- 
sions, due for example to telemetry drop-outs). Performance 
estimation results are discussed in Sect. |1] and their impli- 
cations in Sect. |5] Finally, Appendix 1X1 contains details of 
how the simulated Eddington light curves used throughout 
the paper were generated. 



2 MAXIMUM LIKELIHOOD-BASED 
ALGORITHMS 

2.1 Maximum likelihood approach in the 
Gaussian noise case 

Transit searches are generally performed by comparing light 
curves to a family of models with a common set of parame- 
ters, differing from each other according to the different val- 
ues used for these parameters. The best set of parameters 
is identified by finding the model most likely to have given 
rise to the observed data, i.e. the model with the highest 
likelihood L. 

If the noise in each data point di is assumed to be Gaus- 
sian (an assumption also valid for Poisson noise in the limit 
of large numbers of photons), the likelihood can be written 
as the product of independent Gaussian probability distri- 
bution functions: 



n 



exp 



2(7? 



(1) 



where dt is the data value at time ti and ri is the corre- 
sponding model value. A'' is the total number of data points 
and Oi the error associated with di. Eq. Q can be rewritten 
as: 
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so that likelihood maximisation, in the case of Gaussian 
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Figure 1. Schematic representation of the family of step-function 
models used in the Gregory-Loredo method. 



noise is equivalent to minimisation, since the noise prop- 
erties ai are assumed to be known, i.e. fixed. 



2.2 The Gregory-Loredo method 

The generic method developed by Gregory & Loredo (GL92, 
G99) to detect periodic modulations in X-ray data, was used 
as the starting point of the present work. This method is 
based on a Bayesian maximum likelihood approach where 
the model consists of a periodic step function with period p, 
and m bins (labelled 1 to j) of equal duration p/m (which 
can readily be generalised to unequal duration bins if nec- 
essary). Each model is characterised by p, m, the epoch e 
(which is equal to the time t at the start of the first bin) 
and the individual bin levels Vj. Such a model is illustrated 
in Fig. The repartition of the data points into the m bins 
is defined by: 



ji = int {1 + m [ {ti + p - e) mod p ] /p} 



(4) 



where ji is the number of the bin into which the data 
point falls and int(3;) is the largest integer less than or equal 
to X. 

For a given m, p and e, the contributions from all pos- 
sible values for the individual bin levels rj are analytically 
integrated over. Individual likelihoods are then computed at 
each point in the (m, p, e) parameter space. By marginalis- 
ing over each parameter in turn, one obtains a global pos- 
terior probability for the entire family of periodic models. 
Marginalising over a parameter 9 consists of multiplying the 
(multi-dimensional) likelihood function by the (assumed) 
prior probability distribution (Bayesian prior) for 6, then 
integrating over all values of 6. This global posterior prob- 
ability can then be divided by the equivalent probabilities 
for a constant and/or aperiodic model to give an odds ratio, 
which is greater than 1 if there is significant evidence for 
periodicity. If this is the posterior probability distri- 

bution for each parameter 6 can be computed by marginal- 
ising the likelihood function over all the other parameters. 
The best value of 6 is that which gives rise to the maximum 
in the 1-D posterior probability distribution for 6. The in- 
terested reader is referred to GL92 & G99 for more details. 

We discuss in the next section how this approach can 
be modified, without loss of generality, to obviate the need 
for marginalising out the m variables rj, corresponding to 



the values of each model bin. This in turn leads to a very 
simple transit detection algorithm for the special case of 
two discrete levels, of unequal duration, applicable to most 
generic transit searches. 

2.3 Optimum calculation 

By directly maximising the likelihood, or in this case min- 
imising x^i for Biuy generalised step-function model, it is 
straightforward to show that whatever the number and rel- 
ative duration of the bins, the optimal value for the bin levels 
rj can be determined directly from the data given the other 
model parameters p, m and e. If we refer to the contribu- 
tion from bin j to the overall as Xj) and define J as the 
ensemble of indices falling into bin j, we have: 

{di - rj) 



2 



E 



(5) 



The value Vj of the model level rj that minimises Xj is 
then simply given by the standard inverse variance- weighted 
mean of the data inside bin j, since by setting dxj/drj to 
zero we have: 
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Substituting into Eq. 0, x? now becomes: 
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where Xj denotes the minimised value of Xj- The contribu- 
tion from each of the m bins can be simplified by expanding 
Eq. 0: 
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From Eq. 





we have: 
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The overall minimised x over all bins is thus: 

w 2 r 

x^-E^-E^Ti. (13) 

i = l ' j = l L is./ ' . 

The first term in Eq. I|18|l is entirely independent of the 
model, and hence stays constant, so that only the second 
term needs to be calculated for each set of trial parameters. 
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Figure 2. Schematic representation of the family of models used 
in Paper I (top), and in the present paper (bottom). 



2.4 Making use of the known characteristics of 
planetary transits 

The Gregory-Loredo method makes no assumptions about 
the shape of the variations, and is fairly computation- 
ally intensive. However, when trying to detect planetary 
transits, most of the information is concentrated in a 
very sma ll portion of the light curve. In a previous paper 
iAigrain _ fc Favat a 2002, hereafter Paper I), we adapted the 
Gregory-Loredo method to the planetary transit case by 
having one long out-of-transit bin (bin 0) and n short in- 
transit bins (see Fig. |21 top panel) . The value of n used was 
typically 4. For a given n, the parameters defining each can- 
didate model are then p, e, and the transit duration d. The 
likelihood computation was carried out as described in G99. 

This algorithm performed well when tested on simu- 
lated data^, but the likelihood calculation was still compu- 
tationally intensive. The odds ratio method was not used 
to identify light curves showing significant evidence of tran- 
sits, due to considerations detailed in Paper I. Instead, boot- 
strap simulations containing hundreds of light curves with 
different realisations of the same noise distribution, with and 
without transits, were used to define optimised detection 
thresholds in terms of posterior probability maxima. 

A number of improvements have been made since the 
publication of Paper I: 

(i) Given the current state of exo-planet research, the use 
of Bayesian priors is not expected to contribute significantly 
to the performance of the algorithm. The information avail- 
able on period and duration distributions is relatively scarce 
for giant planets, and non-existent for terrestrial planets. 
The priors used in Paper I were generic and mostly iden- 
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tical to those used by G99 for X-ray pulsars, rather than 
specifically optimised for transit searches. 

(ii) Using the rather than the likelihood as a detec- 
tion statistic, and implementing the calculation as outlined 
in Sect. 12.31 significantly reduces the computational require- 
ments of the detection process. 

(iii) The shape of most planetary transits is sufficiently 
simple that, for detection purposes (as opposed to detailed 
parameter estimation), a single in-transit bin, as illustrated 
in Fig. 121 (bottom panel) provides enough information. A 
significant advantage of this simplification is that it makes 
the method far more robust and capable of coping with real 
data, and all its concomitant problems, with negligible loss 
in detection efficiency. 

(iv) Once a detection is made, a shape-estimation phase 
with either a large value of n, or by detailed model fitting 
of the phase folded light curve, can be implemented. As the 
dependency of transit shapes as a function of the stellar 
and planetary parameters is relatively well-known, Bayesian 
priors may have a part to play in this phase. This is, however, 
outside the scope of the present paper. 

2.5 x^"miiii™isation with a box shaped transit. 

The algorithm used in the present paper evolved from that 
of Paper I, taking into consideration the points listed in 
Sect. 12.41 The model therefore consists of one out-of-transit 
bin and a single level in-transit bin. (Although this simpli- 
fication may seem disingenuous, by suitably pre-processing, 
or adaptively filtering, the signal to remove intrinsic stellar 
variability, this is a valid approximation to transit detec- 
tion in practice.) All the data points falling into the out-of- 
transit bin form the ensemble O, while those falling into the 
in-transit bin form the ensemble /. No Bayesian priors are 
used. Adapting Eq. 1131 to this model gives: 



(14) 



Provided the transits are shallow and of short duration (i.e. 
the most common case), the ensemble O contains the vast 
majority of the data points, so that do ~ d (where d is the 
weighted mean of the entire light curve). Substituting this 
approximation into Eq. 11141 : 



JV 

E 



d \ -:- 2 ' 



(15) 



The first two terms in Eq. 11511 are constant. The minimisa- 
tion of is therefore achieved by maximising the detection 
statistic Q, given by: 



which can also be expanded as: 




(16) 



(17) 



If the light curve is robustly "mean-corrected" prior to run- 
ning the algorithm, such that di is replaced by Ad;, di be- 
comes Adj, the depth of the model transit. This results in 
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a further simplification where the only free parameters are 
now the phase, period, and duration of the transit, since the 
depth is determined given the other three. It is also appar- 
ent that Q is simply equal to the square of the in-transit 
signal-to-noise ratio. This is easier to see in the case where 
ffi = a for all i (a good approximation to the case for space 
data). Eq. 1171 then becomes: 

where n/ is the number of points in /, and y^^.^^ Adj/nj 
is the mean of the in-transit points, i.e. the model transit 
depth (the weighting being unnecessary in that case). 

Eq. 11811 is used when the errors are constant, or when 
no individual error estimates are available for each data 
point. In the latter case, the Median Absolute Deviation 
(MAD) of the dataset is used to estimate a, as this is 
more ro bust to outliers tha n a simple standard error es- 
timate llHoaglin et al.lll983D . For a Gaussian distribution 
Crms = 1.48 X MAD and this factor is used throughout 
to scale the MAD sigmas. If individual error estimates are 
available, Eq. (I17t provides a more precise estimate of Q at 
the cost of a slight increase in computation time. 

If the noise is Gaussian, a theoretical signal-to-noise 
threshold (i.e. Q threshold) can in principle be computed 
a priori to keep the false alarm rate below a certain value 
J Jenkins et al.ll2002l) . 

2.6 Comparison with other transit search 
techniques 

In following through the steps of the previous sections our 
prime motives were to modify a general purpose Bayesian 
periodicity estimation algorithm to make it simpler, faster 
and more robust. In so doing we have arrived at a very 
similar formulation to that developed by other authors, 
though the details o f the implementation differ. For example, 
iKovacs et al.l (j2003) derived and tested a box-fitting method 
(BLS) similar to the present algorithm on simulated ground 
based data with white noise, and showed that significant de- 
tections followed for in-transit signal-to-noise ratios greater 
tha n 6. 

IStreet et alJ (|200^ used a transit finding algorithm 
based on a matched filter technique. After identifying and re- 
moving large amplitude variable stars they generated model 
light curves consisting of a constant out-of-transit level and 
a single in-transit section. The models were generated for a 
series of transit durations and phases, and a x'^-hke measure 
was then used to select the best model (indeed their Eq. 3 is 
essentially a special case of the method derived in Sect. 12.31 
for single transits) . 

Udalski et al^ i002lj| . who have claimed the first direct 
detections of transiting planetary candidates, also imple- 
mented a version of the BLS algorithm and noted that it 
was much more efficient than their own algorithm based on 
"a simple cross-correlation with an error-less transit light 
curve" (Udalski et aL..002a) . 

In a comp arison of several transit finding algorithms, 
iTinglev) JoOS^) found that matched filters and cross- 
correlation gave the best results compared with progres- 
sively more general methods ranging from BLS, through 



Deeg 's method jDovle et alJl200Cil to Defay's (iDefav et alJ 

[2Q0lD Bayesian approach. The fact that matched filters and 
cross-correlation methods give good results is hardly sur- 
prising, and can easily be deduced from the minimisa- 
tion developed in Sect. 12.31 Examination of Eq (jSJ shows 
that the dominant term is the cross-term ^ dirj/af, which 
needs to be maximised. The first term is a constant for a 
given dataset, while the final model term should have much 
smaller influence. The cross-term is exactly a generalised 
cross-correlation function and also identical to a matched 
filter. The more general methods suffer from the added com- 
plexity of the underlying model, which through the Bayesian 
view of Occam's Razor, reduces the tightness of the posterior 
probability distribution of the parameter estimation. What 
is however surprising, is that the BLS method did not give 
at least as good a result as the matched filter and cross- 
correlation methods. We would expect the BLS method to 
have similar performance to the matched filter as it is math- 
ematically almost identical. 



2.7 Optimised parameter space coverage 

The formulation of the detection statistic presented in 
Sect. l2.5l is fully defined given only the dataset and the start 
and end times of each model transit. The model parameters 
are thus the duration d, period p and epoch/phase e (defined 
for our purposes as the time at the start of the first transit 
in the dataset). 

The range of expected transit durations is relatively 
small - from a few hours for close-in, rapidly orbiting plan- 
ets, to almost a day for the most distant planets transiting 
more than once within the timescale of the planned observa- 
tions. A simple discrete sampling prescription can therefore 
be adopted for the duration without leading to large num- 
bers of trial values. One option is to choose the step 5d be- 
tween successive trial durations to be approximately equal 
to the average time step St between consecutive data points. 
This ensures that models with the same period and epoch 
and neighbouring trial durations differ on average by ~ 1 
data point per transit. However, if the observation sampling 
rate is high - a sampling rate of 10 min is envisa ged for 
most targets for Eddington in planet-finding mode (iFavatal 
'2003^ - a larger step in duration can be used, provided it is 
smaller than the shortest significant feature in the transit, 
namely the ingress and egress, which have typical durations 
of ~ 30 min. 

The period sampling prescription is designed to ensure 
that the error in the phase (or equivalently epoch) of the last 
model transit in the light curve is smaller than a prescribed 
value. Capping the error on the period (by using a constant 
trial period step) is not sufficient, as the error on the epoch 
of the n^^ transit will be n times the error on the epoch 
of the first. This would lead to a larger overall error for 
shorter periods, where the number of transits in the light 
curve is large, thus introducing a bias in the distribution 
of detection statistic with period. This bias is not present 
if one uses a constant step in trial frequency. Defining the 
relative frequency v — T/p, T being the total light curve 
duration, the phase of an event occurring at time t is given 
hy 6 = 2TTt/p — 2Titv/T, so that for the last transit in the 
light curve 9 « Smax = 2%^. A fixed step in u thus leads to 
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a fixed error in Smax- By trial and error, a value of 0.05 was 
found to be suitable for Si/. 

One caveat in the case of space missions with high sam- 
pling rates lasting several years, is that the above prescrip- 
tion can lead to very large numbers of trial periods. This im- 
plies that the overall algorithm must be extremely efficient. 
Some steps taken to optimise the efficiency are described 
below. 

The phase, or epoch step interval, is set to the average 
sampling rate of the data since by so doing one can generate 
the phase information at no extra computational cost using 
an efficient search algorithm, detailed below. 

2.8 A weighting scheme to account for 
non-continuous sampling 

A further complication stemming from irregular sampling 
and from the finite duration of each sample, is that data 
points nominally corresponding to a time outside a transit 
may correspond partly to the out-of transit bin and partly to 
the in-transit bin. To account for this, the indices of points 
falling either side of the transit boundaries are also stored 
and included is the calculation of Q, but with a weight which 
is < 1 and is inversely proportional to the interval between 
the time corresponding to the data point and the start /end 
time of the transit. This weighting scheme is particularly 
important for data with irregular sampling where transits 
might fall, for example, at the end of a night of ground- 
based observations, or even with spaced-based observations 
during a gap in the temporal coverage. 

2.9 Speeding up the algorithm 

By far the most time consuming operation in computing Q 
and finding the set of parameters which maximises it, is the 
identification of the in-transit points, which must be iden- 
tified for each model d, p and e. If one is dealing with a 
large number of light curves sharing the same observation 
times, it is more efficient to process many light curves si- 
multaneously and compute Q{d,p,e) for the entire block of 
light curves for each set of parameters, as follows. For each 
trial period, the time array is phase-folded. At a given trial 
duration, the in-transit points are identified for the first trial 
epoch, by stepping through the folded time array one ele- 
ment at a time until the start time of the transit is reached, 
and then continuing, storing the corresponding indices, un- 
til the end time of the transit is reached. Q{d,p,e) is then 
computed and stored for each light curve. When moving to 
the next trial epoch, one steps backward through the folded 
time array from the end time of the old transit (which is 
stored between successive trial epochs) until the start time 
of the new transit is found. One then steps forward through 
the time array, storing the indices, until the end time of 
the new transit is reached. Q{d,p,e) is then computed and 
stored, and the epoch incremented, and so forth. 

This minimises the overall number of calculations 
needed. As the number of in-transit points is the same for all 
light curves and a only needs to be computed once per light 
curve (in the constant error case), this leaves only the sum 
of the in-transit points to be computed once per set of pa- 
rameters and per light curve. The optimum number of light 



curves to process simultaneously depends on the amount of 
memory available. 

A further speed increase is obtained by noting the 
redundancy within the computation of Q for a range of 
phase/epoch and period trial values. Breaking down the 
search to a two-stage process consisting of a single transient 
event detector (essentially a matched filter stage) followed 
by a multiplexed period/phase search, removes the inner 
loop summation of data from the main search and gives a 
factor of ~ 10 improvement in execution time. 

Example run-times computed using a laptop equipped 
with a 1.2 GHz Pentium IV processor with 512 MB of RAM 
are as follows. The light curves consisted of 157 680 floating 
point numbers, i.e. each was ~ 630 KB in size. The trial 
period and duration ranges were 180 to 400 d and 0.5 to 
0.7 d respectively. These ranges are roughly appropriate to 
search for transits of planets in the habitable zone of a Sun- 
like star, and correspond to a total number of tested (p, d, e) 
combinations of ~ 5 x 10^. After finding the optimal number 
of light curves to search simultaneously, the runtime per 
light curve was ~ 4 s. 

Note that close-in planets with periods below the range 
included in this simulation are, of course, of interest, so that 
lower trial periods (and hence lower trial durations) would 
also be included when searching for transits in real data, 
thereby increasing the runtime. As the trial period range is 
increased, the number of trial periods becomes prohibitively 
large due to the use of even sampling in frequency space 
(see Sect. 12.71 : this leads to very small trial period steps at 
the low period end of the range if the steps are to be kept 
reasonable at the high period end of the range. This can 
be remedied by splitting the required range of trial periods 
and running the algorithm separately for each period inter- 
val. The runtime increases linearly with the number of trial 
durations. 



3 PRE-PROCESSING IRREGULARLY 
SAMPLED DATA 

Intrinsic variability from the planet host star is expected 
to be the dominant noise source for space-based planetary 
transit searches, and for ground-based searches in the case of 
active stars. As an example, we use throughout the present 
section a light curve simulated according to the planned 
characteristics of the Eddington mission, containing stellar 
variability, planetary transits and photon noise. The proce- 
dure used to generate this light curve is described in more 
detail in Appendix^ The light curve, shown in Fig.|21 cor- 
responds to a solar-age G2V star with apparent magnitude 
V — 13, containing transits of a 2 i?^ planet which last 
~ 13 hr and have a period of 1 yr. It has a sampling of 
10 min and a duration of 3 yr. 

Intrinsic stellar variability can seriously impede the de- 
tection of terrestrial planets by missions such as Eddington 
and Kepler. However, it is possible to disentangle the plane- 
tary transit signal from other types of temporal variability if 
the two have sufficiently different temporal characteristics. 
To illustrate this we show the power spectra of the differ- 
ent components contributing to the light curve mentioned 
above in Fig. |1] Although the power contained in the tran- 
sit signal is small compared to both stellar and photon noise 
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Figure 3. Simulated Eddington light curve for a y = 13 solar- 
age G2V star orbited by a 2 i?0 planet with a period of 1 yr (see 
Appendix^ for details). Top panel: Entire light curve. Bottom 
panel: First 30 days, with a transit 1.5 d after the start. The flux 
values shown have been normalised to have a mean of 1. 



Figure 4. Power spectrum of the light curve shown in Fig. |3 
(upper grey line). Lower grey line: stellar variability only. Lower 
black line: transits only (3 transits). Upper black line: photon 
noise. The power spectrum is dominated by stellar variability at 
low frequencies and by photon noise at high frequencies. 



components (and would be even smaller for the case of an 
Earth-size planet), it retains significant power for frequen- 
cies higher than ~ 1 /iHz, where the stellar signal starts to 
drop off steeply. As long as this condition is fulfilled (i.e. if 
the stellar variability occurs on sufficiently long timescales) , 
one should be able to separate and detect the transits. Fur- 
thermore, in the case of multiple transits, the regular period 
of the transits also helps constrain the Fourier space occu- 
pancy of the transit signal with respect to the stellar signal. 



3.1 Wiener or matched filtering approach 

ICarpano et alJ ll2003l) demonstrated how use of an optimal 
filter can simultaneously pre-whiten and enhance the vis- 
ibility of transits in data dominated by stellar variability. 
The Fourier-based method presented there is also closely 
related to a minimum mean square error (MMSE) Weiner 
filter. However, even for space-based missions uneven sam- 
pling of the data will occur. In these real-life cases, irregu- 
larly sampled data implies that standard Fourier methods 
are no longer directly applicable and a more general tech- 
nique is required. 

To gain some insight to the problem consider the general 
case of intrinsic stellar variability, with the received signal 
x{t) is composed of the three components: 



x{t) = s{t) +r{t) +n{t) 



(19) 



where s{t) is the intrinsic time variable stellar light curve, 
r{t) is the transiting planet signal, and n{t) denotes the 
measurement plus photon noise, which we can take to be 
random (and Gaussian in the cases of interest here)^. Each 

* Strictly speaking, the 1"* two terms in Eq. I19i should be multi- 
plicative, but in the limit of low amplitude variability and shallow 
transits, an additive combination is a very good approximation. 



component is statistically independent, hence the expected 
power spectrum <I>(tj) of the received signal is simply given 
by: 

|2\ I /l AT/ m2\ 



<f{u;) = {\S{u;)\') + {\R{cu)f) + {\N{cu)f) 



(20) 



and in the case of random, or white, noise (^\N{ij)\^'j is a 
constant, hence guaranteeing positivity of the right hand 
term. This also highlights in a natural way a justification for 
the somewhat arbitrary constant in Eq. (6) in lCaroano et all 
(.2003) and how its value is related to the expected noise 
properties (although it would be more natural to implement 
it as a lower bound). However, as outlined below there is a 
simpler way to implement their technique without the need 
for the additional constant. 

A standard MMSE Wiener filter attempts to maximise 
the signal-to-noise in the component of interest, in this case 
r{t), by convolving the data with a filter, h{t), constructed 
from the ratio of the cross-spectral energy densities between 
observation and target, such that: 



x'{t) = h{t) ® x{t) 



X'{ij) = H{w) X{w) 



and (using * to denote complex conjugate): 



{X{lo)X{loY) {\X{lo)\^) 



(21) 



(22) 



for a long enough run (a fair sample) of observations. In 
practice the only example we have of x{t) is often singular, 
implying that the best estimate of the denominator is simply 
the observed power spectrum $(a;), subject to the constraint 
of positivity imposed by the implicit {\N{lo)\^^ term. Such 
a filter is illustrated in Fig. |K| the top panel shows the filter, 
constructed using the Fourier transform of the light curve 
shown in Fig. El and a box-shaped reference transit of du- 
ration 0.65 d, and the bottom two panels show the filtered 
light curve. 
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Figure 5. Top panel: Wiener filter constructed using the light 
curve shown in Fig. |3] and a reference box-shaped transit of du- 
ration 0.65 d . Middle panel; Filtered light curve. Bottom panel: 
Idem, 1"* 30 days, with a transit 1.5 d after the start. 



Figure 6. Top panel: Matched filter constructed using the light 
curve shown in Fig. |3] and a reference box-shaped transit of du- 
ration 0.65 d. Middle panel: Filtered light curve. Bottom panel: 
Idem, I''* 30 days, with a transit 1.5 d after the start. 



This should be contrasted w ith the pre-whitene d 
matched detection filter employed bv lCarpano et 
illustrated in Fig. |S| (using the same layout as Fig. EJ, and 
which can be written in the form: 

X'(co) = H{cu) Xico) = (23) 

and hence is equivalent to reconstructing the data using 
just the phase of the input signal Fourier transform mod- 
ulated by the amplitude spectrum from the expected tran- 
sit shape (see Fig. O. Viewing the problem in this way re- 
moves the need for the additional constant in their Eq. (6) 
and emphasises the two stage nature of the filtering. The 
pre-whitening suppresses the stellar variability component, 
while the matched filter is directly equivalent to the n = 1 
ML case presented in Sect.|21 

In practice, transit searching can be based directly on 
the output of the filtering, or preprocessing can be used to 
decouple the stellar variation estimation from the transit 
search phase, which then proceeds using the methods out- 
lined in Sect.|5| since the problem has been reduced to the 
simpler one of transit detection in random noise. (In either 
case, detailed investigation of the transit depth and shape 
involves phase folding, unfiltered data, and local modelling.) 

Either of these preprocessing filters works well in the 
case of regularly sampled data with no gaps and with a 
reasonable separation between the signatures of the Fourier 



components of the transits and the stellar variability. In 
Figs. 13 1^ & the transits are distinctly visible in the fil- 
tered light curve. The results in terms of transit detection 
performance using either method are very similar. For sim- 
plicity, the matched filter approach, rather than the Wiener 
filter, is used in the remainder of this paper. 

However, real data, even space-based, suffers from irreg- 
ular sampling and the presence of significant gaps. Fourier 
domain methods cannot be directly applied to irregularly 
sampled data, but it is possible to treat regularly sampled 
data with gaps as a series of n independent time series, and 
to filter them separately. To test this, four arbitrarily chosen 
sections were removed from the light curve shown in Fig. 
(see Fig. |HJ . The matched filter was then applied to the five 
unbroken intervals separately, and the results are shown in 
Fig. I2I Though the filtering is effective on relatively long 
sections of data (bottom panel) it is not successful for short 
intervals (middle panel) , even if they are significantly longer 
than the transit duration. This is because the power spec- 
trum of the stellar noise is estimated from the data in order 
to construct the filter. For this to be successful, the data 
segment needs to be at least twice as long as the longest sig- 
nificant timescale in the star's variability, which is either the 
rotation p eriod or th e long end of the starspot lifetime dis- 
tribution ijAigrain e t al. 2004). In the case of the G2V star 
used in the simulations, the minimum data segment length 
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Figure 7. As Fig. P but the filtered light curve was obtained 
by modulating the phase of the Fourier transform of the data by 
the amplitude spectrum of the reference transit signal. The filter 
was omitted as it is effectively identical to that shown in Fig. IbI 
Comparing, visually, the amplitude, shape and timescale of the 
variations in the filtered data with the bottom two panels of Fig.lHl 
confirms that this gives very similar results to the matched filter 
approach. 
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Figure 8. Simulated light curve with data gaps. Four arbitrarily 
chosen sections were removed from the light curve shown in Fig.l^ 
Note that for this test the gaps were chosen to avoid the transit 
regions for comparison purposes. 

for which the filtering was successful was ~ 60 days (last 
data segment in Fig. , consistent with a rotation period of 
~ 30 days for such a star. 

It is therefore necessary to find other means of coping 
with this additional complexity. We have investigated two 
alternative approaches: one based on a least-squares gener- 
alisation of the Fourier filtering approach; the other based 
on a general purpose iterative clipped non-linear filter. In 
both cases we use the preprocessing to attempt to remove 
the stellar signature, as much as possible, prior to invoking 
the transit detection methods developed in Sect.|21 

3.2 Least-squares filtering 

For a long run of regularly sampled data, a discrete Fourier 
transform asymptotically approaches a least-sq uares fit of 
individual sine and cosine components (see e.g. lBretthors3 



Figure 9. Results of applying the matched filter independently 
to the 5 unbroken intervals of the light curve shown in Fig. |S] 
Top panel; entire filtered light curve. Middle panel: 1"* 30 days. 
Bottom panel: another 30 d section centred on the second transit 
(at 366.5 d). See text for an explanation. 

[l^SSf). This naturally suggests an extension of the approach 
described in Sect. l3.l1 to the case of irregularly sampled data. 
An analogous situation occurs in the generalisation of the 
periodogram method to Fourier estimation of periodicity; 
using generic le ast-squares sine curve fitting is a more fiexi- 
ble alternative jBrault fc Whit jll97ll) . This allows the case 
of gaps in the data, or more generally irregular sampling, to 
be dealt with in a consistent and simple manner. 

The procedure is basically identical to that employed 
for the Wiener filter described in the previous section, but 
the calculation of the Fourier transform, or power spectrum, 
of the received signal is replaced by an orthogonal decompo- 
sition of this signal into sine components whose amplitude, 
phase and zero-point are fitted by least-squares. Each of the 
components has the form: 

i^k (t) = at sin {2Tvkt/T + 4)k) (24) 

where T is the time range spanned by the data. The number 
of components to fit can be chosen such than the ma:ximum 
frequency fitted is equal to some fraction of the Nyquist 
frequency, but for this one must define an equivalent sam- 
pling time St. In the case of regular sampling with gaps, 5t 
is simply the time sampling outside the gaps. In the case of 
irregularly sampled data the definition of 5t is more open 
ended. However, provided that the sampling is close to reg- 
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Figure 10. Top panel: "Power spectrum" (i.e. coefficients a^. 
versus frequency) obtained by the least-squares fitting method 
for the light curve shown in Fig. El Middle panel: Reconstructed 
light curve, obtained by summing over the fitted sine-curves up 
to a frequency of ~ 1.8 cycles/day. Bottom panel: V^^ 30 days of 
the reconstructed light curve. 



ular, a good approximation will be the average time step 
between consecutive data points - keeping in mind that any 
significant gaps should be excluded from the calculation of 
this average. The potentially highest frequency component 
should then have frequency ~ 1/ {25t), although in practice 
a much lower frequency cutoff for the components is all that 
is required. 

Note that the first (zero-frequency) component is ef- 
fectively the mean data value {x{t)) (which can be pre- 
estimated and removed in a robust way e.g. by taking a 
clipped median). The presence of gaps in the data provides 
us with a natural way of obtaining several independent esti- 
mates of (Xis(ui)) by measuring it separately in each interval 
between gaps, or alternatively provides a natural boundary 
for doing independent light curve decompositions. 

Fig. 1101 illustrates this least-squares fitting method, as 
applied to the light curve shown in Fig. The top panel 
shows "power spectrum", i.e. the coefficients versus fre- 
quency, while the bottom two panels show the hght curve 
reconstructed by summing the fitted sine-curves. Note that 
high frequency variations are not reconstructed as only 
the first 2000 sine components were fitted (well below the 
Nyquist limit, but amply sufficient for the purposes of fol- 
lowing the long timescale stellar variability). 



Figure 11. Top panel: Equivalent matched filter constructed us- 
ing the light curve shown in Fig. |21 and a reference box-shaped 
transit of duration 0.65 d. Middle panel: Filtered light curve. Bot- 
tom panel: Filtered light curve, T'* 30 days, with a transit 1.5 d 
after the start. 



The decomposition of the reference (transit) signal can 
usually be well approximated analytically. For example if a 
simple box-shaped transit of duration d is adopted as refer- 
ence signal, the k^^ coefficient is given by; 

= ^'"("^y (25) 

However, this decomposition can also be performed in the 
same way as for the received data, for a reference signal 
of any given shape. The sets of coefficients and r-j, then 
define the filter hk, which is equivalent to the Wiener, or 
matched filter of the previous section: 



hk 



ML 

(Kl) 



(26) 



where the first expression corresponds to the standard 
Wiener filter, and the second to the filter used in 
ICarpano etaJ] l|200,1) . 

Fig llll ilhistrates this filtering method. Using the second 
expression in Eq. 126L a "matched filter" h^ (top panel) is 
constructed from the coefficients Uk and rk (the latter com- 
puted according to Eg. 1251 . The filtered light curve, obtained 
by multiplying the ak by hk and reversing the "transform" , 
is shown in the middle panel, with a zoom on the first 30 
days in the bottom panel. 

Fig. 1121 shows the results of the matched filter con- 
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Figure 13. Light curve with data gaps filtered using the non- 
Hnear technique (black curve) . The input data was the light curve 
shown in Fig. |H] The window of the iterative median filter used 
was 3 X 0.65 d. The grey curve shows the same data with the 
residual noise level after filtering measured and artificial data 
with Gaussian distributed noise of the same standard deviation 
generated to fill the gaps. This illustrates the fact that, after 
non-linear filtering, the light curve (outside the transits) is well 
approximated by a constant level plus white noise. 



Figure 12. As Fig. 1111 but the input light curve is that shown 
in Fig.|S] with 4 significant data gaps. 

structed using the least-squares fitting method when the 
light curve contains gaps (as in Fig. IHJ. The performance 
of the filter is generally not affected by the gaps, though 
artifacts near gap boundaries can sometimes be introduced. 

The case of irregular sampling is not illustrated here, for 
practical reasons: if the sampling was allowed to vary, say, 
by ±10 % of the normal sampling time in a random fashion, 
the effect is not visible in plots of such long light curves. In 
any case, we have found it to have negligible effect on the 
the least-squares filtering. 

3.3 Non-linear filtering 

If the timescale of the transits is shorter than the timescale 
for the majority of the dominant stellar variations, iterative 
non-linear time domain filters provide a powerful way of sep- 
arating out short timescale events. A good example of this 
type of approach can be based around a standard median 
filter. 

The data is first, if necessary, split into segments, using 
any significant gaps in temporal coverage to define the split 
points. These gaps, defined as missing or bad data points, or 
instances where two observations are separated in time by 
more than a certain duration, can be automatically detected. 

Each segment of data is then iteratively filtered using 
a median filter of window ~ 2 to 3 times the transit dura- 
tion, followed by a (small window) box-car filter to suppress 
level quantisation. The difference between the filtered sig- 



nal and the original is used to compute the (robust) MAD- 
estimated scatter (sigma) of the residuals. The original data 
segments are then fc-sigma clipped (with A; = 3) and the 
filtering repeated, with small gaps and subsequent clipped 
values fiagged and ignored during the median filtering oper- 
ation. The procedure converges after only a few iterations. 

Break points and/or edges are dealt with using the stan- 
dard technique of edge reflection to artificially construct 
temporary data extensions. This enables filtering to proceed 
out to the edges of all the data windows. 

The main advantage of using a non-linear filter is that 
the exact shape of the transit is irrelevant and the only free 
parameter is the typical scale size of the duration of the 
transit events. The main drawback is that the temporal in- 
formation in the segments is essentially ignored. However, 
providing the sampling within segments is not grossly irreg- 
ular this has little impact in practice. This filter is also rela- 
tively fast due to its simplicity: with the same computer as 
before, the running time for a transit duration of ~ 0.5 day 
is 4 s per light curve, about the same as the time required 
for the Wiener filter. The least-squares fitting method was 
significantly slower (requiring approximately 30 s when 1500 
frequencies were fitted). 

Figure [T^ illustrates this method as applied to the light 
curve with gaps shown in Fig. |H| As with the indirect least- 
squares filtering, the high frequency noise remains, but this 
does not impede transit detection. Given the simplicity of 
this method and its good performance in the presence of 
data gaps, it appears to be the most promising, as long as the 
sampling remains relatively regular (if the sampling is sig- 
nificantly irregular, the least-squares fitting method, which 
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Figure 14. Results of transit search after non-linear filtering. The 
input of the transit search program was the black curve shown 
in Fig. 1131 Top panel: detection statistic as a function of trial 
epoch for the preliminary single transit search (see Sect. 12.91 . 
The signature of all three transits (e = 1.5, 366.5 & 731.5 d) is 
clearly visible. Middle panel: multiple transit detection statistic as 
a function of trial period. Bottom panel: multiple transit detection 
statistic as a function of epoch at the optimal period of 365.0 d. 
The detected epoch (1.5 d) is correct. The x-axis for the top and 
bottom panels were shifted by 100 d for clarity. 



takes the time of each observation into account directly, is 
likely to perform better). 

The results of applying the transit search algorithm to 
the filtered light curve are shown in Fig. 1141 The detection 
is unambiguous (and remains so for a 1.5 7?® planet with 
otherwise identical parameters, though the detection is not 
successful for a 1 Rq planet with only 3 transits^. 

The step-like appearance and systematic slope of the 
middle panel (period determination) is due to a combination 
of the discrete (and small) number of potential transits of 
the phase estimation stage which precedes it and the search 
for a minimum (over phase) at each trial period. For each 
trial period the number of independent attempts to find a 
maximum in phase/epoch increases as the trial period in- 
creases. Furthermore, the single transit phase has negative- 
going excursions clipped out to enhance the detectability of 
real transits. This leads to a systematic bias toward higher 
maxima as a function of trial period. The steps are at har- 
monics and sub-harmonics of the fundamental period and 



The star is a 4.5 Gyr old G2 dwarf in all cases. 



4 PERFORMANCE EVALUATION 

In this section, we describe Monte Carlo simulations carried 
out to evaluate the performance of the transit detection al- 
gorithm described in Sect. 12.51 combined with the iterative 
non-linear filter introduced in Sect. 13.31 



4.1 Method 

The meth od employed was id e ntical to that described in Sec- 
tion 5.1 of lAigrain fc Favatal ll2002l). which was first u sed in 
the context of transit searches bv lDovle et alJ i2000h . The 
detection statistic (in this case the signal-to-noise ratio of 
the best candidate transit) is computed for A'^ light curves 
with transits. All light curves have the same parameters, but 
different realisations of the noise and different epochs ran- 
domly drawn from a uniform distribution (the epoch should 
not affect the detection process). The process is repeated 
for A'' transit-less light curves, which have noise characteris- 
tics identical to those of the light curves with transits. The 
chosen value of 100 for A*' is a compromise between accu- 
racy and time constraints, and suffices to give a reasonable 
estimate of the performance of the method. 

As the aim was to test the combined filtering and de- 
tection process, the light curves were subjected to the iter- 
ative nonlinear filter, before being forwarded to the transit 
detection algorithm. To avoid prohibitively time-consuming 
simulations, and thus to allow several star/planet configura- 
tions to be tested, a single transit duration value was used 
(corresponding roughly to the FWHM of the input transits) . 

Once the algorithm has been run on all the light curves, 
the next step consists in choosing a detection threshold: any 
light curve for which the maximum detection statistic ex- 
ceeds this threshold will be considered to contain a candi- 
date transit. If a transit-less light curve gives rise to a statis- 
tic above the threshold, a false positive: a candidate transit 
appears to have been detected when there is in fact none. 
Conversely, if the maximum detection statistic for a light 
curve with transits lies below the threshold, the transit (s) 
will go undetected: a false negative. 

The optimal threshold, given a set of light curves which 
are known to share the same noise characteristics, can be 
chosen from the results of the transit search itself to min- 
imise the false alarms and missed transits. This is illustrate d 
in a schematic way in Figure 3 of lAigrain fc Favatal i2002l) . 
Detection statistic histograms ideally should show a clear 
separation between real transits and false alarms, allowing 
a simple choice of boundary between the respective distribu- 
tions. The location of the boundary is chosen as a compro- 
mise between maximising the detection rate and minimising 
the number of false alarms. 

In certain circumstances, it might be more important to 
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Figure 15. Results of the performance evaluation for 5 
star/planet configurations, as detailed in Table ^ Solid his- 
tograms: distributions of the maxima of S/N statistics computed 
by the transit detection algorithm after non-linear filtering for 
100 light curves containing transits. Dashed histograms: idem for 
100 light curves containing stellar variability and photon noise 
only. Thick vertical lines: optimal detection threshold. 

minimise missed detections (for example if the sought-after 
events are very rare, particularly if false alarms can easily 
be weeded out at a later stage). In other circumstances (for 
example if it is very difficult to test the reliability of any 
candidate events through further observations) it may be 
more desirable to minimise false alarms. However, as our 
present aim is simply to carry out a simple performance 
evaluation, we did not give priority to either kind of error 
over the other and just minimised the sum of the two types 
of error. 

4.2 Results 

4-2.1 Photon-noise only case 

The aim of this simulation was to compare the performance 
of the present algorithm to others, which have mostly been 
tested on white-noise only light curves. 
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If the present method is to improve on the performance 
of the Bayesian approach it is derived from, it should be able 
to detect reliably a 1.0 planet orbiting a G2V star with 
photon noise correspo nding^ to V = 13, given three transits 
in the light curve. In lAigrain fc Favatal i20o3l . simulations 
showed that such a planet should be easily detected around 
a smaller (K5V) but fainter {V — 14) star with the older al- 
gorithm and no filtering. The V = 13, G2 case corresponds 
to a S/N that is larger by a factor of 1.07, and should there- 
fore be detected easily if the new method is as efficient as 
the old. 

After a set of simulations was run for such a configura- 
tion, the maximum detection statistic from the noise only 
light curves was S/N = 5.79, while the minimum value from 
the light curves with transits was S/N = 7.41 (see Fig. 115b ). 
Any threshold in between would therefore allow the detec- 
tion of all the transits where present, with no false alarms. 

N ote that the 5'/A'^ limit of 6, quoted bv lKovacs et alJ 
(j2002|) for their BLS method, which is statistically close to 
ours, falls as expected in the range of thresholds that would 
be suitable in the present case. 



4-2.2 Photon noise and stellar variability 

• 1.0 i?e planet orbiting a G2V star 

This configuration is identical to that in Sect. 14.2.11 but 
with stellar variability added. It is also similar to the case 
illustrated in Figs. 1^ to 1141 but with a smaller planet. The 
results are shown in Fig. 115b . The distributions of the detec- 
tion statistics from the light curves with and without tran- 
sits overlap almost entirely, i.e. the performance is poor. The 
threshold that minimises the sum of false alarms and missed 
detection leads to 56 of the first and 26 of the second. 

Assuming that the sampling rate, light curve duration, 
and stellar apparent magnitude are fixed, there are three 
factors which should lead to better performance: a larger 
planet, a shorter orbital period (i.e. more transits) or a 
smaller star. Each of these options in turn is investigated 
below. 

• 1.5 ii® planet orbiting a G2V star 

The histograms are relatively well separated (see Fig. 1151 1. 
with only a small overlap, so that the optimal threshold of 
5'/A'^ = 7.85 lead to one missed detection and no false alarms. 

It is interesting to note the similarity between the re- 
sults of this simulation and the requirements used for the 
design of the Kepler mission, which was to detect planets 
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given a signal-to-noise ratio totalling at least 8 for at least 
three transits^. 

• 1.0 i?e planet orbiting a G2V star with 6 transits 

The aim of this set of simulations was to investigate the 
effect of increasing the number of transits in the light curve 
by a factor of two by reducing the orbital period to 182 d. 
This is equivalent to increasing the overall duration of the 
observations. As expected, this leads to higher 5'/A^ values 
and hence better performance, with only 13 false alarms and 
16 missed detections (see Fig. IISH 'I . 

• 1.0 f?0 planet orbiting a K5V star 

A K5 star is smaller than a G2 star, leading to deeper tran- 
sits, but also m ore active, leading t o more stellar variability. 
Recent studies dAigrain et alJlioM) suggested that the for- 
mer effect prevailed over the latter, and that K or even M 
type stars might make better targets for space missions seek- 
ing to detect habitable planets than G stars, but these were 
based only on results from a few individual light curves, 
rather than Monte Carlo simulations. 

The present tests confirm this trend: the separation be- 
tween the with- and without transit distributions is wider 
(see Fig. I15fe ) than in the previous case, though the best- 
threshold false alarm and missed detection rates remain high 
at 13 and 25 % respectively. 

Note the higher S/N values for the transit- less light 
curves compared to the G2 case, which suggests the presence 
of more residual stellar variability after filtering, as would 
be expected. 



5 DISCUSSION 

Starting from a general purpose maximum likelihood ap- 
proach we have demonstrated the the links between a vari- 
ety of period and transit finding methods and have shown 
that matched filters, cross-correlation, least-squares fitting 
and maximum likelihood methods are all facets of the 
same underlying principle. In the simple approximation of 
rectangular-shaped transits embedded on a fiat continuum 
and in white noise, all of these approaches can be tuned to 
give similar detection results. 

The transit detection algorithm presented here provides 
a unified approach linking all these methods. Computational 
efficiency is of particular importance in the context of large, 
long duration, high sampling missions such as Eddington 
and Kepler, and the present method would allow a search 
for transits by habitable planets to be performed on 20,000 
3 yr long light curves with 10 min sampling in less than a 
day. Including the time required to apply the nonlinear filter, 
which for the laptop used takes ~ 4 s per light curve per filter 
duration, this would increase to 3 d (using three difi'erent 
filter durations). This is achieved at no cost in efficiency: 
in white noise only, the algorithm is capable of detecting 
transits d own to approx i matel y the same 5'/A'^ limit as that 
quoted bv lKovacs et al.l (|20o3) for their BLS method, which 
has been the most successful method to date in terms of 
practical results, being used by the OG LE team to discover 
most of their candidate transits, see llUdalski et Zll002bl. 

liooi). 

See www.kepler. arc .nasa. gov/slzes .html. 



This approach is predicated on the assumption of pe- 
riodic transits hidden in random noise, usually assumed to 
be superposed on a flat continuum with regular continuous 
sampling. In the real world, stellar (micro) variability is ex- 
pected to be the dominant signal component. We have then 
shown how to generalise the transit finding method to the 
more realistic scenario where complex stellar variability, ir- 
regular sampling and long gaps in the data, are all present. 

The two filtering methods developed to deal with this 
case share some advantages - both can be applied to data 
with gaps - but they also have different properties. The 
least-squares fitting method is capable of making use of 
the time information in data with irregular sampling. It 
also allows a theoretically optimal filter (i.e. the Wiener 
or matched filter) to be combined with a pre-whitening 
filter, although from the point of view of detection, the 
matched filter is the main active component of any maxi- 
mum likelihood-based detection algorithm. As a by product 
of the filtering, the stellar signal can also be reconstructed. 
However, this is computationally intensive, particularly if 
one wishes to fit higher frequencies. Its performance also de- 
pends quite critically on concordance between the duration 
of the reference transit and that of any true transit. 

On the other hand, iterative non-linear filtering is sim- 
ple to implement and fast, but ignores any local time in- 
formation (except for the long gaps which are detected au- 
tomatically). This means that its performance is likely to 
degrade if the sampling is seriously irregular. However, it 
is the most efficient method in cases such as those inves- 
tigated here. By removing any signal on timescales longer 
than two-three times the estimated transit duration, it is 
likely to be less affected by the value chosen for that du- 
ration. Although more work is needed to establish quanti- 
tatively the relative merits of the two approaches, it seems 
more efficient, given the results so far, to use the iterative 
non-linear filtering method prior to a general transit search. 
The least-squares fitting method could be employed in the 
more difficult (e.g. very irregular sampling) or borderline (as 
in Sect. 14.231 cases, where the additional information used 
about the transit shape may lead to better performance. 

Whatever the method used, there is a fundamental limit 
to what can be achieved. Stellar variability can only be fil- 
tered out if an orthogonal decomposition of the transit and 
stellar signal is possible, e.g. if the two signatures in the fre- 
quency domain do not overlap by too much. Therefore, very 
rapidly rotating stars where the rotation period is close to 
the transit duration, or stars showing much more power than 
the Sun on timescales of minutes to hours (e.g. higher meso- 
or super-granulation) will be problematic targets. Even in 
the hypothetical situation where all stellar noise is removed, 
the remaining white noise will also place a limit on the per- 
formance of the transit detection algorithm, and hence on 
the apparent magnitude of star around which transits of a 
certain depth can be found. In white Gaussian noise, any 
transit yielding a signal-to-noise ratio above a fixed thresh- 
old (estimated to be « 6 in Sect. l4.2"Tl should be detectable. 
Considering photon noise alone, for a given stellar radius, 
orbital period and transit duration, the smallest detectable 
planet radius would therefore scale as _B~^/* or exp(m/10) 
where B and m are the star's apparent brightness and mag- 
nitude respectively. 

The natural progression of this work will be further 
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quantification of tiie performances attained, and tlie identi- 
fication of the best method to use for a given situation (i.e. 
star-planet combination, instrument characteristics and/or 
sampling). As in the present paper, this can be done through 
Monte Carlo simulations, and more realistic noise profiles 
can be included in the light curves (e.g. instrumental noise). 
Extensive simulations can be performed for a given target 
field by coupling the stellar variability model to a galactic 
population model and any available extinction information 
on the field. However, it will only be meaningful to carry 
out such simulations when the design, target fields and ob- 
serving strategies of the missions in question are finalised 
and when more information about stellar micro-variability 
is available. 

Our main conclusion if that even with realistic contam- 
ination from stellar variability, irregular sampling, and gaps 
in the data record, it is still possible to detect transiting 
planets with an efficiency close to the idealised theoretical 
bound. In particular, space missions are tantalisingly close 
to being capable of detecting earth-like planets around G 
and K dwarfs. 
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APPENDIX A: SIMULATION OF REALISTIC 
EDDINGTON LIGHT CURVES 

In this appendix we briefiy outline the method used to sim- 
ulate the light curves shown in Figs.|Sl&|Hl 
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Al Planetary transits 



iDeed Jl999l) 's IDL based Universal Transit Modeller (UTM) 
was used to simulate noise-free light curves. UTM includes 
a linear limb-darkening law, and limb-darkening coefficients 
from .Van Hamme ( were used. For a given star-planet 
configuration, the other input parameters were the ratio of 
planetary to stellar radius, the planet's orbital period and 
distance, and the sampling time and duration. For the lat- 
ter, values of 10 min and 3 yr respectively were used, as 
appro priate for Eddington in planet-finding mode (,Favata 
1200311 . The output is in units of relative flux, normalised to 
an out-of-transit value of 1.0. These units are used through- 
out. Note that no reflected light from the planet is included, 
and that all orbits are assumed to be circular. The planet's 
orbital plane is also assumed to be aligned along the line-of- 
sight. 

For the current paper, we chose to model a 2 i?® planet 
orbiting a G2V star (7?* = 1.03 Rq), i.e. a radius ratio 
of 0.018, leading to a relative transit depth of 3.24 x 10~*. 
This is not the smallest detectable planet around such a star 
(with the methods presented here) , but it is the smallest for 
which transits are visible by eye in both the pre- and post- 
filtering light curves. The planet's orbital period is 1 yr, and 
its orbital distance 1 AU. The epoch of the first transit is 
1.5 d. The power spectrum of this transit-only light curve is 
shown as the black line with repeated "humps" in Fig. |1| 



A2 Intrinsic stellar variability 

The model used to simulate stellar micro- variability, which 
allows the generation of light curves for stars o f various spec- 
tral ty pes and ages, was presented in detail in lAierain et alJ 
l|2004ri . with the aim of testing and refining filtering and 
transit detection algorithms, in the context of space-based 
transit searches such as COROT, Eddington and Kepler. 

The starting point for the model is the Sun's pho- 
tometric variability, which has been studied at ultra- 
high p recision since Januar y 1996 by the VIRGO exper- 
iment ijFrohlich et al.l onboard the SoHO observa- 
tory. Empirical scaling laws, either published iSkumanidJ 
I1972|:In^ es et al.lll984r) o r derived from published datasets 
iick et al.lll987i I199rI : iRadick et alJ ll998: Hc nrv et alJ 
I200(J . for a wide range of stars), are then used to scale the 
amplitude and frequency distribution of the Sun's variability 
to other stellar ages and masses. 

Light curves can be generated for dwarfs of any spec- 
tral type between F5 and K5, and for all ages later than 
the Hyades (625 Mvr. IPerrvman et al.|[l993) . In the present 
paper, a 4.5 Gyr old G2V star was modelled, again with 
a sampling time of 10 min and duration of 3 yr. The stel- 
lar light curve, also in relative flux units (and whose power 
spectrum is shown as the lower grey line in Fig. |1J , is then 
multiplied by the planetary light curve described in lAll 

The IDL source code used to construct these, together 
with a number of existing simulated light curves, are avail- 
able from www.ast.cam.ac.uk/~suz/simlc. 



A3 Photon noise 

The Eddington baseline conflguration* , at the time of writ- 
ing, consists of four co-aligned wide-field telescopes, with a 
total collecting area of 0.764 w? . Combined with the op- 
tics and CCD performance, this leads to an expected pho- 
ton count of, for example, 1.4 x 10^ 7 s~^ for a V = 13 
star. The photon noise in relative fiux units should thus be 
well approximated by a Gaussian distribution with a nor- 
malised standard deviation of 1.09 x 10~^ for 10 min inte- 
grations, and such a randomly generated photon noise value 
was added to each data point in the combined star-planet 
light curve. The result is the light curve shown in Fig. 
while the power spectrum of the noise component is shown 
as the approximately constant black line in Fig. |1] 



A4 The above with gaps 

To investigate the impact of data gaps, the following four 
sections of data were removed from the gap-less light curve: 

• indices 4 000 to 8 999 (i.e. t = 27.8 to 62.5 d); 

• indices 55 092 to 65 060 (i.e. t = 382.6 to 451.8 d); 

• indices 110 000 to 123 009 (i.e. t = 763.9 to 854.2 d); 

• indices 140 395 to 149 999 (i.e. t = 975.0 to 1041.7 d). 

These were chosen arbitrarily, but with the aim of en- 
suring a variety of gap and data interval durations, and 
avoiding the removal of any transits. In reality, data gaps 
are of course likely to affect the number of observed tran- 
sits, but this is a different issue to that investigated here, 
i.e. the development of filters which can remove the stellar 
signal in the presence of gaps regardless of the presence (or 
lack) of transits. The resulting light curve is shown in Fig.|H| 

Note that missions like Eddington are expected to have 
a very high duty cycle (> 95 % - Favata, priv. comm.), com- 
pared to a value of ~ 70 % for the simulated light curve used 
in the present work. Such a low duty cycle is therefore even 
more conservative than the expected worst case scenario. 

This paper has been typeset from a Tp^X/ ffl^jX file prepared 
by the author. 
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